Introduction

Recent upward trends in natural disasters have caused incalculable loss of property, economic well-being, and life to millions all across the globe, especially in poor and underdeveloped nations. From fires in California and tornadoes in Kentucky, to flash floods in India and Nepal and a devastating earthquake in Haiti, natural disasters have become one of the greatest threats to a peaceful existence on earth\(^{[4]}\). It is natural to question the role of climate change and its precursors such as rising temperatures and emissions in the preponderance of such maleficent events. An understanding of the causes of the recent influx of natural disasters as well as potential solutions necessitate an investigation into recent trends in climate data.

The motivation for this report is to provide a better understanding of the link between drivers and correlating factors of climate change and natural disaster patterns. We will observe data from 1900 to the present in order to detect trends and predict future outcomes. Descriptive and inferential analyses will be performed on various types and quantities of natural disasters over the factors of time, global average land and ocean temperatures, and \(CO_2\) and methane emissions. In performing these analyses we hope to discover any correlation between natural disasters and high vs. low temperatures and emission levels and the establish a trend that will help us know what to expect in the future. The scope of this paper is to assess factors that cause or are correlated with the characteristics of natural disasters. Future studies should make use of the information here to establish best solutions to the problem of increasing natural disaster occurrences.

Description of Data

This report utilizes three datasets. The first contains information about natural disasters occurring globally from 1900 to 2021. We utilize the following variables:

The second dataset contains land and ocean average temperatures from 1899 to 2015. We utilize the following variables:

We average the temperature data over 12-month periods to get data at yearly granularity to match our other datasets.

The third dataset contains emissions data for each year from 1750 to 2020. We utilize the following variables:

Methods

Various statistical methods were used to present our data in an informative way, as well as perform analysis on historical data and make future predictions.

Graphical representations of our data, such as histograms, scatterplots, and curvilinear regression plots are used throughout the report. We used linear regression modeling to fit a relationship between average land temperatures and quantity of natural disasters. Time series analyses were performed to observe trends in temperature over time and make predictions about the future. We performed ANOVA to determine whether or not different types of natural disasters occurred at different temperatures or different levels of emissions. Finally, we performed a QDA analysis to see if we could accurately predict which type of disasters would be most likely to occur under which temperature/emissions conditions.

All of the coding for the project was done in R using RStudio. Plots were made using Base R, ggplot2, and plotly. We used several packages including dplyr for data manipulations, forecast for time series analysis, and MASS and caret for supervised classification modeling.

Results and Analysis

Relationship between Land Average Temperature and Quantity of Natural Disasters

We see that the yearly number of global natural disasters increases roughly exponentially from 1910 to 2010 before finally falling down a bit in the 2010s. The number appears low for the 2020s but we are only two years into this period, and the data captured only goes through 2021.

Looking at data more locally, we see that California fires have mostly been on an upward trend but not at an exponential rate like total global natural disasters. It is noteable that there are no data on fires in California from before 1970 and from 1975 to 1990. Since this is obviously not the case in actuality, we can postulate that data on fires in California is incomplete.

This graphic includes a scatterplot of the global land average temperatures for each year as well as a regression line drawn through them with the shaded region representing a 95% confidence interval. There is a clear upward trend in average temperature over time since 1900, which probably comes with little surprise. One interesting thing to point out is the massive outlier around the year 1900. It is worth looking into whether this is due to some anomolous event or if it is just a data entry error.

This plot shows the trend between total number of disasters and average land temperature. As in the previous plot, the points represent actual data points whereas the curve is a fitted regression model. It appears that the relationship between temperature and number of disasters follows a logistic trend defined by the following function:

\[\begin{equation} f(x) = \frac{L}{1 + e^{-k(x-x_0)}} \end{equation}\]

where

L is the curve’s maximum value

\(x_0\) is the x value of the function’s midpoint

k is the logistic growth rate\(^{[5]}\)

In essence, the number of disasters is roughly constantly low for temperatures on the lower end, increases roughly linearly for temperatures in the middle, and is roughly constantly high for temperatures on the higher end.

We attempt to model the data using two methods. The first plot above shows an attempt at using the sigmoid generating function in R. However, it does not generate the S-shaped curve that we are looking for. Nonetheless, it is not a bad fit for our data.

The second plot utilizes splines, which are locally generated cubics. We experimented with the parameter nknots for the R function smooth.spline and determined that a value of 6 produced the best fit. While the first plot was not bad, the method of splines generated a model that much more closely matches the S-shaped curve we were looking for.

The parameter estimates for the generated spline model are as follows:

  • Smoothing Parameter s=0.2129022 (\(\lambda\)=0.0016508)

  • Cubic Coefficients: -2.730562, 6.3536438, 24.3408847, 45.6597687, 121.6594541, 780.9746029, 848.1691568, 891.1093324

Time Series Analysis of Land Average Temperature

We obtain the ACF and PACF plots for the raw data and two different methods of time series generation. ACF and PACF plots can be used to determine the Moving Average (MA) and Auto-Regressive (AR) components, respectively, of a time series model. The ACF and PACF plots for the original data, “rough” part of the data, and first difference of the data (i.e. \(Y_t = X_t - X_{t-1}\), where \(X_t\) is the original data series) are displayed above. The first ACF/PACF plots for the original series show no discernable MA component and a possible AR component of order 2.

However, We can probably improve upon this model by analyzing three different methods of handling time series data. The first method involves subtracting off the “trend” part of the data and analyzing the remaining “rough” part of the data. From the ACF plot, it can be determined that the rough part has an MA component of order 3. The PACF plot gives no relevant information.

The second method involves taking the first difference of the data and then analyzing \(Y_t\). In this case, the PACF plot shows that this first difference series probably has an AR component of order 3 while the ACF plot gives no relevant information.

Finally, for the third method we utilize the R function auto.arima with parameter seasonal set to TRUE. This will automatically generate the “best” model according to the algorithm used in auto.arima, and will check for any seasonal components. Since we don’t need to calculate the order of the AR and MA components, as this is handled by R, we do not need the ACF and PACF plots.

Here we display the results of fitting the three previously described models against the historical data in the left column, and the forecasted data obtained from each model in the right column.

The auto-generated seasonal model produces a smoother plot than the rough plus trend and first difference models. This is not necessarily a good thing, however, because we would like to keep the individual fluctuations present in the original data when making new predictions. Looking at AIC values may help us decide which model is best for forecasting. The AIC values for the rough plus trend, first difference, and seasonal models are -115.8025208, -58.0940606, and -61.4049308, respectively. We choose the best model by choosing the model with the lowest AIC value - in this case the rough plus trend model.

The ten-year forecasts for the three models look significantly different. Rough plus trend shows variations only up to the third year and then simply follows a straight upward-sloping line, corresponding to the trend. This is due to the fact that MA models of order q only provide random variance up to the \(q^{th}\) forecasted year. The first difference forecast shows some amount of variance through all ten forecasted years while also seeming to follow the trend. This makes it seem promising in its predictive value. The seasonal model forecast is a constantly upward-sloping line which follows the trend but provides no random variance.

Based on the AIC criterion and the forecast plots, it seems that we can rule out the seasonal model. The first difference model appears to be best if we need a model that provides accurate yearly fluctuations for more than three years, whereas the rough plus trend model seems to be the most likely to give the most accurate average prediction.

We plot the fitted historical data for the past ten years and the forecasted data for the next ten years using the rough plus trend and first difference models above. The red line indicates the point at which we go from fitting historical data to forecasting future years. By zooming in on the data, we can see that both the fitted historical values and the forecasted values differ for the two models.

Categorization of Natural Disaster Type

As we can see from the plots above, \(CO_2\) emissions and land and land-and-ocean average temperatures have risen since 1990. The overall trend is the same for methane, but with a noticeable dip between 1990 and around 1996.

The data that we have available for \(\textbf{all}\) of these variables is on a shorter time period than for temperature alone (smaller sample size), so we must be sure to handle the data carefully and make sure no assumptions are violated in our analyses.

We create plots of all three predictor variables against one another in order to determine if there is multicollinearity. Strong multicollinearity appears to exist between land and land-and-ocean average temperatures and between \(CO_2\) and methane. Because of this, we should only keep one variable from each group; we choose to keep \(CO_2\) and land average temperature.

The two plots above are a representation of average land temperature and \(CO_2\) emissions data points for each natural disaster type. There isn’t a very clear way to categorize natural disasters by the temperature and \(CO_2\) emissions that we can observe by simply looking at these plots. Therefore we will require more analytical methods to determine a way to separate them.

One thing to notice from the color-coded plot of temperature is that it appears that, though usually directly proportional to emissions levels, there are some higher temperature points that occur near the middle of the emissions spectrum. Therefore, emissions cannot be the only driver of increases in global temperature.

Here we separate out the factors of \(CO_2\) emissions and average land temperature for each disaster type into two boxplots. For the first plot, it appears most disaster types have roughly similar median temperatures, though there are some deviations. We also observe a few outliers for the disaster types “Extreme temperature” (somewhat obviously), “Flood”, and “Insect infestation”.

For the second plot, the median \(CO_2\) levels do seem a bit more spread out than for temperature. It is important to mention, however, that while differences observed between medians among boxplots may indicate a noteable difference in the true mean values of the measured variable, they do not necessarily indicate a statistically significant difference.

We removed disaster categories with only 2 observations (Animal accidents and Impacts) and ran Henze-Zirkler tests on each of the disaster categories to test for multivariate normality and Box’s M test for equality of covariance. We find that none of the categories have normal distributions and they do not have equal covariance (p=0.05).

Since neither normality nor equal covariance holds, this rules out using Linear Discriminant Analysis (LDA). Quadratic Discriminant Analysis (QDA) is similar to LDA except that it doesn’t require the assumption of equal covariance and can be robust to non-normal data. Another method that could work well for non-normal unequal covariance data would be Support Vector Machines (SVM).

The QDA method involves finding the class, k, which maximizes the following function:

\[\begin{equation} \delta_k(x) = -\frac{1}{2}log|\Sigma_k| - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) + log\pi_k \end{equation}\]

Where

\(\Sigma_k\) is the variance-covariance matrix for class k

\(\mu_k\) is the mean of class k

\(\pi_k\) is the prior probability of class k\(^{[6]}\)

Before beginning our classification of disaster type by average land temperature and \(CO_2\) emission level, we filter “Animal accident” and “Impact” out of our dataset because they have a very small number of observations. We then randomly split all of our roughly 20,000 observations into a training set with 70% of the observations and a test set with the remaining 30% of observations. We train the training set on a QDA model and get the following results:

Call:
qda(Type ~ LandAverageTemperature + co2, data = train)

Prior probabilities of groups:
             Drought           Earthquake             Epidemic 
         0.042493751          0.075283047          0.116379944 
Extreme temperature                 Flood   Insect infestation 
         0.047787090          0.360608734          0.002793707 
           Landslide  Mass movement (dry)                Storm 
         0.046904867          0.002646670          0.260476401 
   Volcanic activity             Wildfire 
         0.013968534          0.030657256 

Group means:
                     LandAverageTemperature      co2
Drought                            9.438884 28693.59
Earthquake                         9.397477 27978.79
Epidemic                           9.410260 27489.77
Extreme temperature                9.471152 29462.65
Flood                              9.460585 29186.72
Insect infestation                 9.280246 26417.09
Landslide                          9.407602 28294.25
Mass movement (dry)                9.232944 27014.59
Storm                              9.403640 28220.35
Volcanic activity                  9.401119 27968.85
Wildfire                           9.430425 27976.82

We then use our QDA model to predict the data from our test set and check our predictions against the actual observations. We obtain the following confusion matrix:

Confusion Matrix and Statistics

                      Reference
Prediction             Drought Earthquake Epidemic Extreme temperature  Flood
  Drought                    0          0        0                    0   238
  Earthquake                 0          0        0                    0   354
  Epidemic                   0          0        0                    0   642
  Extreme temperature        0          0        0                    0   244
  Flood                      0          0        0                    0  1926
  Insect infestation         0          0        0                    0    19
  Landslide                  0          0        0                    0   222
  Mass movement (dry)        0          0        0                    0     6
  Storm                      0          0        0                    0  1356
  Volcanic activity          0          0        0                    0    72
  Wildfire                   0          0        0                    0   145
                      Reference
Prediction             Insect infestation Landslide Mass movement (dry) Storm
  Drought                               0         0                   0    26
  Earthquake                            0         0                   0    50
  Epidemic                              0         0                   0    51
  Extreme temperature                   0         0                   0    14
  Flood                                 0         0                   0   193
  Insect infestation                    0         0                   0     3
  Landslide                             0         0                   0    40
  Mass movement (dry)                   0         0                   0     4
  Storm                                 0         0                   0   183
  Volcanic activity                     0         0                   0    18
  Wildfire                              0         0                   0    24
                      Reference
Prediction             Volcanic activity Wildfire
  Drought                              0        0
  Earthquake                           0        0
  Epidemic                             0        0
  Extreme temperature                  0        0
  Flood                                0        0
  Insect infestation                   0        0
  Landslide                            0        0
  Mass movement (dry)                  0        0
  Storm                                0        0
  Volcanic activity                    0        0
  Wildfire                             0        0

Overall Statistics
                                          
               Accuracy : 0.3617          
                 95% CI : (0.3494, 0.3742)
    No Information Rate : 0.8961          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0133          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: Drought Class: Earthquake Class: Epidemic
Sensitivity                      NA                NA              NA
Specificity                 0.95472            0.9307          0.8811
Pos Pred Value                   NA                NA              NA
Neg Pred Value                   NA                NA              NA
Prevalence                  0.00000            0.0000          0.0000
Detection Rate              0.00000            0.0000          0.0000
Detection Prevalence        0.04528            0.0693          0.1189
Balanced Accuracy                NA                NA              NA
                     Class: Extreme temperature  Class: Flood
Sensitivity                                   NA       0.3687
Specificity                              0.95575       0.6815
Pos Pred Value                                NA       0.9089
Neg Pred Value                                NA       0.1113
Prevalence                               0.00000       0.8961
Detection Rate                           0.00000       0.3304
Detection Prevalence                     0.04425       0.3635
Balanced Accuracy                             NA       0.5251
                     Class: Insect infestation Class: Landslide
Sensitivity                                 NA               NA
Specificity                           0.996226          0.95506
Pos Pred Value                              NA               NA
Neg Pred Value                              NA               NA
Prevalence                            0.000000          0.00000
Detection Rate                        0.000000          0.00000
Detection Prevalence                  0.003774          0.04494
Balanced Accuracy                           NA               NA
                     Class: Mass movement (dry) Class: Storm
Sensitivity                                  NA      0.30198
Specificity                            0.998285      0.74043
Pos Pred Value                               NA      0.11891
Neg Pred Value                               NA      0.90142
Prevalence                             0.000000      0.10395
Detection Rate                         0.000000      0.03139
Detection Prevalence                   0.001715      0.26398
Balanced Accuracy                            NA      0.52120
                     Class: Volcanic activity Class: Wildfire
Sensitivity                                NA              NA
Specificity                           0.98456         0.97101
Pos Pred Value                             NA              NA
Neg Pred Value                             NA              NA
Prevalence                            0.00000         0.00000
Detection Rate                        0.00000         0.00000
Detection Prevalence                  0.01544         0.02899
Balanced Accuracy                          NA              NA

We can see from the confusion matrix that the model predicted every test point to of type “Flood” or “Storm”, which led to a prediction accuracy of only 36.17%. There could be several reasons for this behavior including the number of observations from each class being widely different (imbalanced data) or the mean values of the two predictors being too similar among the classes.

For this reason, we chose to replicate the test with a subset of the original observations. We took 500 randomized observations from each of the disaster types “Flood”, “Wildfire”, and “Earthquake”. These were chosen because we hypothesized that they were events that would occur in different climatological settings (i.e. different mean temperature and \(CO_2\) levels) and they had enough observations to obtain a reasonably-sized and equal sample from each.

Repeating the test with this subset yielded the following results:

Confusion Matrix and Statistics

            Reference
Prediction   Earthquake Flood Wildfire
  Earthquake         12    46       93
  Flood              13    52       89
  Wildfire            7    34      104

Overall Statistics
                                          
               Accuracy : 0.3733          
                 95% CI : (0.3285, 0.4199)
    No Information Rate : 0.6356          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.066           
                                          
 Mcnemar's Test P-Value : <2e-16          

Statistics by Class:

                     Class: Earthquake Class: Flood Class: Wildfire
Sensitivity                    0.37500       0.3939          0.3636
Specificity                    0.66746       0.6792          0.7500
Pos Pred Value                 0.07947       0.3377          0.7172
Neg Pred Value                 0.93311       0.7297          0.4033
Prevalence                     0.07111       0.2933          0.6356
Detection Rate                 0.02667       0.1156          0.2311
Detection Prevalence           0.33556       0.3422          0.3222
Balanced Accuracy              0.52123       0.5366          0.5568

We can see that we now get predictions in all three categories but the prediction accuracy only increased by just over 1% to 37.33%. This is still not an optimal prediction accuracy, so other solutions to the classification problem should be attempted.

Since QDA didn’t give optimal results we attempt SVM in case the QDA model was not robust to the violation of normality of the data.

Confusion Matrix and Statistics

                      Reference
Prediction             Drought Earthquake Epidemic Extreme temperature  Flood
  Drought                    0          0        0                    0   220
  Earthquake                 0          0        0                    0   315
  Epidemic                   0          0        0                    0   625
  Extreme temperature        0          0        0                    0   230
  Flood                      0          0        0                    0  1856
  Insect infestation         0          0        0                    0    17
  Landslide                  0          0        0                    0   217
  Mass movement (dry)        0          0        0                    0     5
  Storm                      0          0        0                    0  1235
  Volcanic activity          0          0        0                    0    67
  Wildfire                   0          0        0                    0   139
                      Reference
Prediction             Insect infestation Landslide Mass movement (dry) Storm
  Drought                               0         0                   0    44
  Earthquake                            0         0                   0    89
  Epidemic                              0         0                   0    68
  Extreme temperature                   0         0                   0    28
  Flood                                 0         0                   0   263
  Insect infestation                    0         0                   0     5
  Landslide                             0         0                   0    45
  Mass movement (dry)                   0         0                   0     5
  Storm                                 0         0                   0   304
  Volcanic activity                     0         0                   0    23
  Wildfire                              0         0                   0    30
                      Reference
Prediction             Volcanic activity Wildfire
  Drought                              0        0
  Earthquake                           0        0
  Epidemic                             0        0
  Extreme temperature                  0        0
  Flood                                0        0
  Insect infestation                   0        0
  Landslide                            0        0
  Mass movement (dry)                  0        0
  Storm                                0        0
  Volcanic activity                    0        0
  Wildfire                             0        0

Overall Statistics
                                         
               Accuracy : 0.3705         
                 95% CI : (0.3581, 0.383)
    No Information Rate : 0.8449         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.0344         
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: Drought Class: Earthquake Class: Epidemic
Sensitivity                      NA                NA              NA
Specificity                 0.95472            0.9307          0.8811
Pos Pred Value                   NA                NA              NA
Neg Pred Value                   NA                NA              NA
Prevalence                  0.00000            0.0000          0.0000
Detection Rate              0.00000            0.0000          0.0000
Detection Prevalence        0.04528            0.0693          0.1189
Balanced Accuracy                NA                NA              NA
                     Class: Extreme temperature  Class: Flood
Sensitivity                                   NA       0.3768
Specificity                              0.95575       0.7091
Pos Pred Value                                NA       0.8759
Neg Pred Value                                NA       0.1727
Prevalence                               0.00000       0.8449
Detection Rate                           0.00000       0.3184
Detection Prevalence                     0.04425       0.3635
Balanced Accuracy                             NA       0.5429
                     Class: Insect infestation Class: Landslide
Sensitivity                                 NA               NA
Specificity                           0.996226          0.95506
Pos Pred Value                              NA               NA
Neg Pred Value                              NA               NA
Prevalence                            0.000000          0.00000
Detection Rate                        0.000000          0.00000
Detection Prevalence                  0.003774          0.04494
Balanced Accuracy                           NA               NA
                     Class: Mass movement (dry) Class: Storm
Sensitivity                                  NA      0.33628
Specificity                            0.998285      0.74929
Pos Pred Value                               NA      0.19753
Neg Pred Value                               NA      0.86017
Prevalence                             0.000000      0.15506
Detection Rate                         0.000000      0.05214
Detection Prevalence                   0.001715      0.26398
Balanced Accuracy                            NA      0.54279
                     Class: Volcanic activity Class: Wildfire
Sensitivity                                NA              NA
Specificity                           0.98456         0.97101
Pos Pred Value                             NA              NA
Neg Pred Value                             NA              NA
Prevalence                            0.00000         0.00000
Detection Rate                        0.00000         0.00000
Detection Prevalence                  0.01544         0.02899
Balanced Accuracy                          NA              NA
Confusion Matrix and Statistics

            Reference
Prediction   Earthquake Flood Wildfire
  Earthquake         43    42       66
  Flood              34    67       53
  Wildfire           33    35       77

Overall Statistics
                                          
               Accuracy : 0.4156          
                 95% CI : (0.3696, 0.4626)
    No Information Rate : 0.4356          
    P-Value [Acc > NIR] : 0.816683        
                                          
                  Kappa : 0.1252          
                                          
 Mcnemar's Test P-Value : 0.001419        

Statistics by Class:

                     Class: Earthquake Class: Flood Class: Wildfire
Sensitivity                    0.39091       0.4653          0.3929
Specificity                    0.68235       0.7157          0.7323
Pos Pred Value                 0.28477       0.4351          0.5310
Neg Pred Value                 0.77592       0.7399          0.6098
Prevalence                     0.24444       0.3200          0.4356
Detection Rate                 0.09556       0.1489          0.1711
Detection Prevalence           0.33556       0.3422          0.3222
Balanced Accuracy              0.53663       0.5905          0.5626

Again, we only get observations classified as Flood and Storm when classifying all disaster types with only a less than 1% increase in accuracy from the QDA model to 37.05%. It should also be mentioned that due to SVM’s increased complexity with more categories of the prediction variable, the model takes quite a bit longer to train than the QDA model. However, the reduced dataset (500 observations each of Earthquake, Flood, and Wildfire) has 41.56% accuracy, an over 4% increase from the reduced QDA model, and doesn’t take significantly longer to train.

A final attempted solution is to artificially balance the data. We try both upsampling the minority classes to the size of the majority class and mixed upsampling and downsampling so that all classes have 500 observations. We get the following results:

Confusion Matrix and Statistics

                      Reference
Prediction             Drought Earthquake Epidemic Extreme temperature  Flood
  Drought                  136          0      715                  296   376
  Earthquake                71          0      553                  262   391
  Epidemic                  28          0      921                  254   349
  Extreme temperature       53          0      481                  619   410
  Flood                     99          0      488                  407   472
  Insect infestation         0          0      410                    0   212
  Landslide                 89          0      608                  233   537
  Mass movement (dry)       99          0      172                  100   307
  Storm                    105          0      519                  339   373
  Volcanic activity         96          0      516                  301   447
  Wildfire                  95          0      869                  326   275
                      Reference
Prediction             Insect infestation Landslide Mass movement (dry) Storm
  Drought                              85        39                 343     0
  Earthquake                          196        31                 420     0
  Epidemic                            111        82                 222     0
  Extreme temperature                 139        18                 234     0
  Flood                               133        58                 269     0
  Insect infestation                 1131         0                 312     0
  Landslide                           144       111                 264     0
  Mass movement (dry)                 106         0                1359     0
  Storm                               165        78                 413     0
  Volcanic activity                   169        75                 444     0
  Wildfire                             97        42                 199     0
                      Reference
Prediction             Volcanic activity Wildfire
  Drought                             60       56
  Earthquake                          65      124
  Epidemic                            19       56
  Extreme temperature                 40      121
  Flood                               50       95
  Insect infestation                   0        0
  Landslide                           60       91
  Mass movement (dry)                  0        0
  Storm                               62       67
  Volcanic activity                   99       30
  Wildfire                            92       95

Overall Statistics
                                         
               Accuracy : 0.2132         
                 95% CI : (0.208, 0.2186)
    No Information Rate : 0.2697         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.1353         
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: Drought Class: Earthquake Class: Epidemic
Sensitivity                0.156142                NA         0.14731
Specificity                0.911695           0.90884         0.93378
Pos Pred Value             0.064577                NA         0.45103
Neg Pred Value             0.965123                NA         0.74780
Prevalence                 0.037575           0.00000         0.26972
Detection Rate             0.005867           0.00000         0.03973
Detection Prevalence       0.090854           0.09116         0.08809
Balanced Accuracy          0.533919                NA         0.54055
                     Class: Extreme temperature  Class: Flood
Sensitivity                              0.19732      0.11376
Specificity                              0.92536      0.91598
Pos Pred Value                           0.29267      0.22791
Neg Pred Value                           0.88047      0.82581
Prevalence                               0.13533      0.17899
Detection Rate                           0.02670      0.02036
Detection Prevalence                     0.09124      0.08934
Balanced Accuracy                        0.56134      0.51487
                     Class: Insect infestation Class: Landslide
Sensitivity                            0.45679         0.207865
Specificity                            0.95489         0.910536
Pos Pred Value                         0.54770         0.051942
Neg Pred Value                         0.93630         0.979898
Prevalence                             0.10682         0.023037
Detection Rate                         0.04879         0.004789
Detection Prevalence                   0.08909         0.092192
Balanced Accuracy                      0.70584         0.559201
                     Class: Mass movement (dry) Class: Storm
Sensitivity                             0.30342           NA
Specificity                             0.95808       0.9085
Pos Pred Value                          0.63416           NA
Neg Pred Value                          0.85169           NA
Prevalence                              0.19323       0.0000
Detection Rate                          0.05863       0.0000
Detection Prevalence                    0.09245       0.0915
Balanced Accuracy                       0.63075           NA
                     Class: Volcanic activity Class: Wildfire
Sensitivity                          0.180987        0.129252
Specificity                          0.908187        0.911116
Pos Pred Value                       0.045475        0.045455
Neg Pred Value                       0.978670        0.969654
Prevalence                           0.023598        0.031708
Detection Rate                       0.004271        0.004098
Detection Prevalence                 0.093917        0.090164
Balanced Accuracy                    0.544587        0.520184
Confusion Matrix and Statistics

                      Reference
Prediction             Drought Earthquake Epidemic Extreme temperature  Flood
  Drought                   15          7       37                   17    22
  Earthquake                 9          5       36                   19    21
  Epidemic                   0          1       70                   15    24
  Extreme temperature        1          5       37                   30    35
  Flood                     12          2       39                   26    26
  Insect infestation         0          0       30                    0     9
  Landslide                  4          0       39                   18    36
  Mass movement (dry)        7         14       15                   11    15
  Storm                      5          7       39                   23    21
  Volcanic activity          8          5       34                    7    32
  Wildfire                   4          2       63                   10    19
                      Reference
Prediction             Insect infestation Landslide Mass movement (dry) Storm
  Drought                               8        14                  14     0
  Earthquake                           14         5                  21     0
  Epidemic                              9         9                  11     0
  Extreme temperature                   9        12                  19     0
  Flood                                10         5                  22     0
  Insect infestation                   74        10                  28     0
  Landslide                            10        14                  15     0
  Mass movement (dry)                   5         0                  92     0
  Storm                                11         7                  28     0
  Volcanic activity                     8         8                  27     0
  Wildfire                              8         5                  14     0
                      Reference
Prediction             Volcanic activity Wildfire
  Drought                              5        2
  Earthquake                          11        8
  Epidemic                             7        5
  Extreme temperature                  2       10
  Flood                                4        3
  Insect infestation                   0        0
  Landslide                            8        7
  Mass movement (dry)                  0        0
  Storm                                6        7
  Volcanic activity                    9        3
  Wildfire                            10       10

Overall Statistics
                                          
               Accuracy : 0.209           
                 95% CI : (0.1896, 0.2294)
    No Information Rate : 0.2659          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1287          
                                          
 Mcnemar's Test P-Value : <2e-16          

Statistics by Class:

                     Class: Drought Class: Earthquake Class: Epidemic
Sensitivity                0.230769          0.104167         0.15945
Specificity                0.920555          0.910168         0.93317
Pos Pred Value             0.106383          0.033557         0.46358
Neg Pred Value             0.966887          0.971372         0.75400
Prevalence                 0.039370          0.029073         0.26590
Detection Rate             0.009085          0.003028         0.04240
Detection Prevalence       0.085403          0.090248         0.09146
Balanced Accuracy          0.575662          0.507168         0.54631
                     Class: Extreme temperature  Class: Flood
Sensitivity                              0.17045      0.10000
Specificity                              0.91186      0.91157
Pos Pred Value                           0.18750      0.17450
Neg Pred Value                           0.90208      0.84421
Prevalence                               0.10660      0.15748
Detection Rate                           0.01817      0.01575
Detection Prevalence                     0.09691      0.09025
Balanced Accuracy                        0.54116      0.50579
                     Class: Insect infestation Class: Landslide
Sensitivity                            0.44578          0.15730
Specificity                            0.94815          0.91229
Pos Pred Value                         0.49007          0.09272
Neg Pred Value                         0.93867          0.95000
Prevalence                             0.10055          0.05391
Detection Rate                         0.04482          0.00848
Detection Prevalence                   0.09146          0.09146
Balanced Accuracy                      0.69697          0.53480
                     Class: Mass movement (dry) Class: Storm
Sensitivity                             0.31615           NA
Specificity                             0.95074      0.90672
Pos Pred Value                          0.57862           NA
Neg Pred Value                          0.86662           NA
Prevalence                              0.17626      0.00000
Detection Rate                          0.05572      0.00000
Detection Prevalence                    0.09631      0.09328
Balanced Accuracy                       0.63344           NA
                     Class: Volcanic activity Class: Wildfire
Sensitivity                          0.145161        0.181818
Specificity                          0.916929        0.915414
Pos Pred Value                       0.063830        0.068966
Neg Pred Value                       0.964901        0.970120
Prevalence                           0.037553        0.033313
Detection Rate                       0.005451        0.006057
Detection Prevalence                 0.085403        0.087826
Balanced Accuracy                    0.531045        0.548616

We get accuracies of 21.32% and 20.9% for the SVM models using upsampled and mixed up/downsampled data, respectively. These are significant reductions from all of the previous methods. We still see two disaster types (Earthquake and Storm) not predicted by the upsampled model and one type (Storm) not classified by the mixed up/downsampled model. The one benefit of using these models appears to be that many of the disaster types have non-null and reasonably high sensitivity and specificity values, which previous models including all disaster types did not. This means that the up/downsampled models may do a better job of differentiating between the disaster types.

Conclusion

In our analysis of the three continuous factors of number of disasters, year, and land average temperature we found all three to be positively related. Since the 1900s, global disasters have increased exponentially while temperature increased mostly linearly except between around 1945 and 1975. Perhaps this time period should be studied in more depth to understand what factors may have led temperatures to stop increasing. It is important to consider outliers and verify that data is complete for all of these factors. We found the spline method to be the best way to generate a model of the S-shaped regression curve of number of disasters over temperature.

We attempted three ways of modeling the time series of average land temperature: rough plus trend, first difference, and auto-generated seasonal. We got the best results with regard to AIC criterion from the rough plus trend model and the best looking forecast from the first difference model. It could be useful to re-attempt a seasonal model using a monthly level of granularity instead of yearly to capture actual variations due to the seasons (i.e. Fall, Winter, Spring, Summer).

Our categorization of disaster type based on land average temperature, \(CO_2\) emissions, and methane emissions proved difficult as there were problems of multicollinearity, non-normality of data, similarity of mean levels of predictors for each disaster type, and different numbers of each disaster type recorded. It would be helpful to look at more predictors that are less correlated with one another and for which the means for each disaster type differ more. These could include factors like geographical factors, wind speed, atmospheric measures, and sunlight.

The discoveries made in this report lead us to believe that more research must be done to discover which of these factors and other factors may be responsible for causing the increase in natural disasters, rather than just being correlated. It is our hope that the information in this report may provide a baseline for others to answer this question as well as figure out potential solutions to the problem.

Code Appendix

# Load libraries and data
library(datetime)
library(dplyr)
library(ggplot2)
library(sigmoid)
library(cowplot)
library(plotly)
library(forecast)
library(MASS)
library(caret)
library(nnet)
#library(survey)
library(effects)
library(tidyr)
library(MVN)
library(e1071)
library(imbalance)

knitr::opts_chunk$set(comment = NA)

disasters1.data <- read.csv('./data/DISASTERS/1900_2021_DISASTERS.xlsx - emdat data.csv')
disasters2.data <- read.csv('./data/DISASTERS/1970-2021_DISASTERS.xlsx - emdat data.csv')
temp.data <- read.csv('./data/GlobalTemperatures.csv')
emissions.data <- read.csv('data/owid-co2-data.txt', header = TRUE)
# Combine data from two disasters datasets
common <- match(colnames(disasters1.data), colnames(disasters2.data))
disasters2.data <- disasters2.data[,common]
disasters.data <- rbind(disasters1.data, disasters2.data)

# Histogram of all disasters per year
hist(disasters.data$Year, xlab = "Year", main = "Total Global Disasters per Year")
# Get data on CA fires
fire.data <- disasters.data[disasters.data$Disaster.Type == "Wildfire",]
row.condition <- Reduce(union, list(grep("cali", fire.data$Location, ignore.case = TRUE),
                 grep("los angeles", fire.data$Location, ignore.case = TRUE),
                 grep("Oakland", fire.data$Location),
                 grep("San Diego", fire.data$Location)))
cali.fires <- fire.data[row.condition,]

# Histogram of CA fires per year
hist(cali.fires$Year, xlab = "Year", main = "California Fires per Year")
# Get temperature data in desired format
temp.data.mod <- temp.data
temp.data.mod$dt <- as.date(temp.data.mod$dt, format = "%Y-%m-%d")
temp.data.mod <- temp.data.mod[temp.data.mod$dt > as.date("1899", format = "%Y"),]
temp.data.mod$dt <- as.numeric(format(temp.data.mod$dt, format = "%Y"))
temp.data.mod <- temp.data.mod %>% group_by(dt) %>% summarize(LandAverageTemperature = mean(LandAverageTemperature),
                                                              LandAverageTemperatureUncertainty = mean(LandAverageTemperatureUncertainty),
                                                              LandAndOceanAverageTemperature = mean(LandAndOceanAverageTemperature),
                                                              LandAndOceanAverageTemperatureUncertainty = mean(LandAndOceanAverageTemperatureUncertainty))

# Plot of average land temperature over time
ggplot(data = temp.data.mod, mapping = aes(x = dt, y = LandAverageTemperature)) +
  ylim(min(temp.data.mod$LandAverageTemperature), max(temp.data.mod$LandAverageTemperature)) +
  labs(title = "Land Average Temperature by Year", x = "Year", y = "Temperature") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_point() + geom_smooth()
# Get number of disasters per year up to 2016
disasters.data.mod <- disasters.data
disasters.data.mod <- disasters.data.mod[disasters.data.mod$Year < 2016,]
disasters.data.mod <- disasters.data.mod %>% group_by(Year) %>% summarize(n = n())

# Join temperature and disaster datasets
temp.and.disasters <- dplyr::inner_join(temp.data.mod, disasters.data.mod, by = c("dt" = "Year"))

# Plot of number of disasters against average land temperature
ggplot(data = temp.and.disasters, mapping = aes(x = LandAverageTemperature, y = n)) +
  ylim(min(temp.and.disasters$n), max(temp.and.disasters$n)) +
  labs(title = "Number of Disasters for each Temperature", x = "Temperature", y = "Disasters") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_point() + geom_smooth()
# Linear model with response number of disasters and predictor average land temperature using sigmoid
num.disasters.v.temp <- lm(n ~ sigmoid::sigmoid(LandAverageTemperature - 50, "logistic"), data = temp.and.disasters)
p1 <- ggplot(data = temp.and.disasters) +
  labs(title = "Land Average Temperature by Year (Fitted)", x = "Temperature", y = "Disasters") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_point(mapping = aes(x = LandAverageTemperature, y = n)) +
  geom_line(mapping = aes(x = LandAverageTemperature, y = num.disasters.v.temp$fitted.values), size = 1, color = 'blue')

# Fit using splines
spline.model <- smooth.spline(x = temp.and.disasters$LandAverageTemperature, y = temp.and.disasters$n, nknots = 6)
p2 <- ggplot(data = temp.and.disasters) +
  labs(x = "Temperature", y = "Disasters") +
  geom_point(mapping = aes(x = LandAverageTemperature, y = n)) +
  geom_line(mapping = aes(x = spline.model$x, y = spline.model$y), size = 1, color = 'blue')
plot_grid(p1, p2, ncol = 1)
## Time series projections ##

# Fit a trend model of land temperatures against time
temp.trend <- smooth.spline(x = temp.and.disasters$dt, y = temp.and.disasters$LandAverageTemperature, nknots = 5)

# Plot ACF and PACF for raw data, first difference of data, and data with trend removed
orig.par <- par()
par(mfrow = c(3,2), mar = c(3,4,3,2))
acf(temp.and.disasters$LandAverageTemperature, main = "ACF (raw data)")
pacf(temp.and.disasters$LandAverageTemperature, main = "PACF (raw data)")

acf((temp.and.disasters$LandAverageTemperature - temp.trend$y), main = "ACF (rough part)")
pacf((temp.and.disasters$LandAverageTemperature - temp.trend$y), main = "PACF (rough part)")

acf(diff(temp.and.disasters$LandAverageTemperature, lag = 1), main = "ACF (first difference)")
pacf(diff(temp.and.disasters$LandAverageTemperature, lag = 1), main = "PACF (first difference)")
par(orig.par)
# Generate rough, first difference, and seasonal models
temp.ts.rough <- arima((temp.and.disasters$LandAverageTemperature - temp.trend$y), order = c(0,0,3))
temp.ts.diff <- arima(temp.and.disasters$LandAverageTemperature, order = c(3,1,0))
temp.ts.seas <- auto.arima(temp.and.disasters$LandAverageTemperature, seasonal = TRUE)

# Generate forecasts for the next 10 years from each model
temp.fc.rough <- forecast(temp.ts.rough, h = 10)
temp.fc.diff <- forecast(temp.ts.diff, h = 10)
temp.fc.seas <- forecast(temp.ts.seas, h = 10)
temp.fc.rpt <- temp.fc.rough$mean + predict(temp.trend$fit, 2016:2025)$y

# Plot the fitted models and forecasts
par(mfrow = c(3,2), mar = c(3,4,3,2))
plot(x = temp.and.disasters$dt, y = temp.and.disasters$LandAverageTemperature, main = "Rough Plus Trend (Fitted)", xlab = "Time", ylab = "Temperature", type = "n")
lines(temp.and.disasters$dt, temp.and.disasters$LandAverageTemperature, col = "black")
lines(temp.and.disasters$dt, fitted(temp.ts.rough) + temp.trend$y, col = "blue", lwd = 2)
plot.ts(temp.fc.rpt, main = "Rough Plus Trend (Forecasts)", xlab = "Time", ylab = "Temperature")
plot(x = temp.and.disasters$dt, y = temp.and.disasters$LandAverageTemperature, main = "First Difference (Fitted)", xlab = "Time", ylab = "Temperature", type = "n")
lines(temp.and.disasters$dt, temp.and.disasters$LandAverageTemperature, col = "black")
lines(temp.and.disasters$dt, fitted(temp.ts.diff), col = "blue", lwd = 2)
plot.ts(temp.fc.diff$mean, main = "First Difference (Forecasts)", xlab = "Time", ylab = "Temperature")
plot(x = temp.and.disasters$dt, y = temp.and.disasters$LandAverageTemperature, main = "Seasonal Model (Fitted)", xlab = "Time", ylab = "Temperature", type = "n")
lines(temp.and.disasters$dt, temp.and.disasters$LandAverageTemperature, col = "black")
lines(temp.and.disasters$dt, fitted(temp.ts.seas), col = "blue", lwd = 2)
plot.ts(temp.fc.seas$mean, main = "Seasonal Model (Forecasts)", xlab = "Time", ylab = "Temperature")
par(orig.par)
# Plot forecast and historical data (last 10 years) from assumed best models
combined.rough.ts <- ts(c(temp.trend$y[107:116]+fitted(temp.ts.rough)[107:116], temp.fc.rpt))
p1 <- ggplot() +
  labs(title = "Rough Plus Trend (Historical and Forecast)", x = "Year", y = "Temperature") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_line(mapping = aes(x = 2007:2026, y = combined.rough.ts)) +
  geom_vline(xintercept = 2017, color = 'red')

combined.diff.ts <- ts(c(fitted(temp.ts.diff)[107:116], temp.fc.diff$mean))
p2 <- ggplot() +
  labs(title = "First Difference (Historical and Forecast)", x = "Year", y = "Temperature") +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_line(mapping = aes(x = 2007:2026, y = combined.diff.ts)) +
  geom_vline(xintercept = 2017, color = 'red')

plot_grid(p1, p2, ncol = 1)
## Predictors of types of natural disasters ##

# Modify emissions data to only include global data and select needed variables
emissions.data.mod <- emissions.data %>% dplyr::select(country, year, co2, methane) %>% filter(country == 'World')

# Plots of emissions/temperatures over time
title <- ggdraw() + draw_label("Temperature and Emissions Trends", fontface='bold')
p1 <- ggplot(data = emissions.data.mod[!is.na(emissions.data.mod$co2),]) +
  labs(x = "Year", y = "Emissions (million tonnes CO2)") +
  geom_smooth(mapping = aes(x = year, y = co2))
p2 <- ggplot(data = emissions.data.mod[!is.na(emissions.data.mod$methane),]) +
  labs(x = "Year", y = "Emissions (million tonnes methane)") +
  geom_smooth(mapping = aes(x = year, y = methane))
p3 <- ggplot(data = temp.data.mod) +
  labs(x = "Year", y = "Land Temperature") +
  geom_smooth(mapping = aes(x = dt, y = LandAverageTemperature))
p4 <- ggplot(data = temp.data.mod) +
  labs(x = "Year", y = "Land and Ocean Temp") +
  geom_smooth(mapping = aes(x = dt, y = LandAndOceanAverageTemperature))
p <- plot_grid(p1, p2, p3, p4, ncol = 2)
plot_grid(title, p, ncol = 1, rel_heights=c(0.4, 1))
# Combine natural disasters, temperature, and emissions datasets for years in which there is
# data for all three of them (1990-2016)
nat.disasters <- data.frame(Year = disasters.data$Year, Type = disasters.data$Disaster.Type) %>% filter(Year >= 1990 & Year <= 2016)
nat.disasters <- nat.disasters %>% left_join(temp.data.mod, by = c("Year" = "dt"), copy = TRUE) %>% left_join(emissions.data.mod, by = c("Year" = "year"), copy = TRUE)

# ANOVA on temperature, CO2, and methane
aov.temp <- aov(LandAverageTemperature ~ Type, data = nat.disasters, na.action = na.exclude)
tukey.temp <- TukeyHSD(aov.temp)
aov.co2 <- aov(co2 ~ Type, data = nat.disasters)
aov.methane <- aov(methane ~ Type, data = nat.disasters)

# Classification of disaster type based on temperature and CO2
pairs(nat.disasters[,c("LandAverageTemperature", "LandAndOceanAverageTemperature", "co2", "methane")])
# custom grid style
axx <- list(
  gridcolor='rgb(255, 255, 255)',
  zerolinecolor='rgb(255, 255, 255)',
  showbackground=TRUE,
  backgroundcolor='rgb(230, 230,230)'
)

create_plotly_fig <- function(data, cat, plotnum) {
  cat_data = data %>% filter(type == cat)
  cat_probs = matrix(cat_data$prob, nrow = 50, ncol = 50)
  fig = plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~cat_probs, scene = paste("scene", plotnum, sep="")) %>% add_surface(showscale = FALSE)
  return(fig)
}
  
create_scene <- function(cat, rid, cid, xcoord, ycoord, scene) {
  return(list(domain=list(x=xcoord, y=ycoord, row=rid, column=cid),
                   xaxis=c(axx, id=scene),
                   yaxis=c(axx, id=scene),
                   zaxis=c(axx, title=paste(cat, "Probability"), id=scene),
                   aspectmode='cube'))

}
# TODO: filter out only Climatological subgroup

model_3d = multinom(as.factor(Type) ~ LandAverageTemperature + co2, data = nat.disasters, trace = FALSE)

mintemp = min(nat.disasters$LandAverageTemperature, na.rm = TRUE)
maxtemp = max(nat.disasters$LandAverageTemperature, na.rm = TRUE)
minco2 = min(nat.disasters$co2, na.rm = TRUE)
maxco2 = max(nat.disasters$co2, na.rm = TRUE)
xygrid = expand.grid(LandAverageTemperature = seq(mintemp, maxtemp, len=50), co2 = seq(minco2, maxco2, len=50))

z = as.data.frame(predict(model_3d, newdata = xygrid, type="probs"))
z = pivot_longer(z, 1:ncol(z), names_to = "type", values_to = "prob")
logit_figs = rep(0, 13)
i <- 1

storm_data = z %>% filter(type == "Storm")
storm_probs = matrix(storm_data$prob, nrow = 50, ncol = 50)
storms = plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~storm_probs, scene = paste("scene", 1, sep="")) %>% add_surface(showscale = FALSE)

flood_data = z %>% filter(type == "Flood")
flood_probs = matrix(flood_data$prob, nrow = 50, ncol = 50)
floods = plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~flood_probs, scene = paste("scene", 2, sep="")) %>% add_surface(showscale = FALSE)

wildfire_data = z %>% filter(type == "Wildfire")
wildfire_probs = matrix(wildfire_data$prob, nrow = 50, ncol = 50)
wildfires = plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~wildfire_probs, scene = paste("scene", 3, sep="")) %>% add_surface(showscale = FALSE)

earthquake_data = z %>% filter(type == "Earthquake")
earthquake_probs = matrix(earthquake_data$prob, nrow = 50, ncol = 50)
earthquakes = plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~earthquake_probs, scene = paste("scene", 4, sep="")) %>% add_surface(showscale = FALSE)

# individual plots
fig1 <- plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~storm_probs, scene='scene1') 
fig1 <- fig1 %>% add_surface(showscale=FALSE)

fig2 <- plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~flood_probs, scene='scene2') 
fig2 <- fig2 %>% add_surface(showscale=FALSE)

fig3 <- plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~wildfire_probs, scene='scene3') 
fig3 <- fig3 %>% add_surface(showscale=FALSE)

fig4 <- plot_ly(x = seq(mintemp, maxtemp, len=50), y = seq(minco2, maxco2, len=50), z = ~earthquake_probs, scene='scene4') 
fig4 <- fig4 %>% add_surface(showscale=FALSE)

# subplot and define scene
fig <- subplot(fig1, fig2, fig3, fig4)
fig <- fig %>% layout(title = "Regressing Disasters on Temp and CO2",
         scene = list(domain=list(x=c(0,0.5),y=c(0.5,1)),
                      xaxis=c(axx, title="Temp"),
                      yaxis=c(axx, title="CO2"),
                      zaxis=c(axx, title="Storm Probability"),
                      aspectmode='cube'),
         scene2 = list(domain=list(x=c(0.5,1),y=c(0.5,1)),
                       xaxis=c(axx, title="Temp"),
                       yaxis=c(axx, title="CO2"),
                       zaxis=c(axx, title="Flood Probability"),
                       aspectmode='cube'),
         scene3 = list(domain=list(x=c(0,0.5),y=c(0,0.5)),
                       xaxis=c(axx, title="Temp"),
                       yaxis=c(axx, title="CO2"),
                       zaxis=c(axx, title="Wildfire Probability"),
                       aspectmode='cube'),
         scene4 = list(domain=list(x=c(0.5,1),y=c(0,0.5)),
                       xaxis=c(axx, title="Temp"),
                       yaxis=c(axx, title="CO2"),
                       zaxis=c(axx, title="Earthquake Probability"),
                       aspectmode='cube'))

fig

#plot_ly(data = nat.disasters, x = nat.disasters$LandAverageTemperature, y = nat.disasters$co2, z = nat.disasters$methane, type = "scatter3d") %>% add_surface(z = as.matrix(nat.disasters$methane))
ggplot(data = nat.disasters) +
  labs(title = "Temperature and CO2 Emissions for Disaster Type", x = "Disaster Type", y = "Million Tonnes CO2") +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 9)) +
  geom_point(mapping = aes(x = Type, y = co2, colour = LandAverageTemperature), size = 5) +
  scale_color_gradient(low = "yellow", high = "red")
fig1 <- create_plotly_fig(z, "Animal accident", 1)
fig2 <- create_plotly_fig(z, "Drought", 2)
fig3 <- create_plotly_fig(z, "Earthquake", 3)
fig4 <- create_plotly_fig(z, "Epidemic", 4)
fig5 <- create_plotly_fig(z, "Extreme temperature ", 5)
fig6 <- create_plotly_fig(z, "Flood", 6)
fig7 <- create_plotly_fig(z, "Impact", 7)
fig8 <- create_plotly_fig(z, "Insect infestation", 8)
fig9 <- create_plotly_fig(z, "Landslide", 9)
fig10 <- create_plotly_fig(z, "Mass movement (dry)", 10)
fig11 <- create_plotly_fig(z, "Storm", 11)
fig12 <- create_plotly_fig(z, "Volcanic activity", 12)
fig13 <- create_plotly_fig(z, "Wildfire", 13)

# subplot and define scene
fig <- subplot(fig1, fig2, fig3, fig4, fig5, fig6, fig7, fig8, fig9, fig10, fig11, fig12, fig13, nrows = 4)
fig <- fig %>% layout(title = "Regressing Disasters on Temp and CO2",
         scene = create_scene("Animal accident", 0, 0, c(0,0.5), c(1.5,2), 1),
         scene2 = create_scene("Drought", 0, 1, c(0.5,1), c(1.5,2), 2),
         scene3 = create_scene("Earthquake", 0, 2, c(1,1.5), c(1.5,2), 3),
         scene4 = create_scene("Epidemic", 0, 3, c(1.5,2), c(1.5,2), 4),
         scene5 = create_scene("Extreme temperature", 1, 0, c(0,0.5), c(1,1.5), 5),
         scene6 = create_scene("Flood", 1, 1, c(0.5,1), c(1,1.5), 6),
         scene7 = create_scene("Impact", 1, 2, c(1,1.5), c(1,1.5), 7),
         scene8 = create_scene("Insect infestation", 1, 3, c(1.5,2), c(1,1.5), 8),
         scene9 = create_scene("Landslide", 2, 0, c(0,0.5), c(0.5,1), 9),
         scene10 = create_scene("Mass movement (dry)", 2, 1, c(0.5,1), c(0.5,1), 10),
         scene11 = create_scene("Storm", 2, 2, c(1,1.5), c(0.5,1), 11),
         scene12 = create_scene("Volcanic activity", 2, 3, c(1.5,2), c(0.5,1), 12),
         scene13 = create_scene("Wildfire", 3, 0, c(0.75,1.25), c(0,0.5), 13)
         )
fig
# Boxplots of temperatures and CO2 levels at which disasters occur
p1 <- ggplot(data = nat.disasters) +
  geom_boxplot(mapping = aes(x = Type, y = LandAverageTemperature)) +
  theme(axis.text.x = element_text(size = 5))
p2 <- ggplot(data = nat.disasters) +
  geom_boxplot(mapping = aes(x = Type, y = co2)) +
  theme(axis.text.x = element_text(size = 5))
plot_grid(p1, p2, ncol = 1)
# Obtained from https://stackoverflow.com/questions/49719660/r-markdown-to-pdf-printing-console-output
print_output <- function(output, cex = 0.5) {
  tmp <- capture.output(output)
  plot.new()
  text(0, 1, paste(tmp, collapse='\n'), adj = c(0,1), family = 'mono', cex = cex)
  box()
}
# CO2 and methane appear to be multicollinear and LandAverageTemperature and CO2 are both NOT normally distributed.  Therefore, use qda.
nat.disasters.mod <- nat.disasters %>% filter(!(Type %in% c("Animal accident", "Impact"))) %>% dplyr::select(LandAverageTemperature, co2, Type) %>% drop_na()

mvns = c()
for (t in unique(nat.disasters.mod$Type)) {
  d <- nat.disasters.mod %>% filter(Type == t) %>% dplyr::select(LandAverageTemperature, co2)
  res <- mvn(data = d, mvnTest = "hz")
  mvns <- append(mvns, res$multivariateNormality["MVN"])
}
names(mvns) <- unique(nat.disasters.mod$Type)
set.seed(123)
train.ind <- sample(1:nrow(nat.disasters.mod), size = 0.7 * nrow(nat.disasters.mod), replace = FALSE)
train <- nat.disasters.mod[train.ind,]
test <- nat.disasters.mod[-train.ind,]
temp.co2.qda <- qda(Type ~ LandAverageTemperature + co2, data = train)

#print_output(temp.co2.qda)
temp.co2.qda
pred.qda <- predict(temp.co2.qda, test)
conf.qda <- confusionMatrix(as.factor(test$Type), pred.qda$class)

#print_output(conf.qda, cex = 0.4)
conf.qda
set.seed(456)
nat.disasters.mod2 <- nat.disasters.mod %>% filter((Type %in% c("Flood", "Wildfire", "Earthquake"))) %>% dplyr::select(LandAverageTemperature, co2, Type) %>% group_by(Type) %>% drop_na() %>% slice_sample(n = 500)

train.ind.red <- sample(1:nrow(nat.disasters.mod2), size = 0.7 * nrow(nat.disasters.mod2), replace = FALSE)
train.red <- nat.disasters.mod2[train.ind.red,]
test.red <- nat.disasters.mod2[-train.ind.red,]
temp.co2.qda <- qda(Type ~ LandAverageTemperature + co2, data = train.red)

pred.qda <- predict(temp.co2.qda, test.red)
conf.qda <- confusionMatrix(as.factor(test.red$Type), pred.qda$class)

#print_output(conf.qda, cex = 0.4)
conf.qda
temp.co2.svm <- svm(as.factor(Type) ~ LandAverageTemperature + co2, data = train)
pred.svm <- predict(temp.co2.svm, test)
conf.svm <- confusionMatrix(as.factor(test$Type), pred.svm)
#print_output(conf.svm, cex = 0.4)
conf.svm

temp.co2.svm <- svm(as.factor(Type) ~ LandAverageTemperature + co2, data = train.red)
pred.svm <- predict(temp.co2.svm, test.red)
conf.svm <- confusionMatrix(as.factor(test.red$Type), pred.svm)
#print_output(conf.svm, cex = 0.4)
conf.svm
set.seed(789)
nat.disasters.mod3 <- nat.disasters.mod %>% dplyr::select(LandAverageTemperature, co2, Type) %>% drop_na()

## Up Sampling

balanced.disasters.up <- nat.disasters.mod3
max_num <- max(count(nat.disasters.mod3, Type)$n)

for (t in unique(nat.disasters.mod3$Type)) {
  diff <- max_num - sum(nat.disasters.mod3$Type == t)
  
  if (diff > 0) {
    balanced.disasters.up <- rbind(balanced.disasters.up, slice_sample(filter(nat.disasters.mod3, nat.disasters.mod3$Type == t), n = diff, replace = TRUE))
  }
}

train.ind.up <- sample(1:nrow(balanced.disasters.up), size = 0.7 * nrow(balanced.disasters.up), replace = FALSE)
train.up <- balanced.disasters.up[train.ind.up,]
test.up <- balanced.disasters.up[-train.ind.up,]

temp.co2.svm <- svm(as.factor(Type) ~ LandAverageTemperature + co2, data = train.up)
pred.svm <- predict(temp.co2.svm, test.up)
conf.svm <- confusionMatrix(as.factor(test.up$Type), pred.svm)
conf.svm

## Down Sampling

balanced.disasters.down <- data.frame()

for (t in unique(nat.disasters.mod3$Type)) {
  if (count(filter(nat.disasters.mod3, nat.disasters.mod3$Type == t)) < 500) {
    balanced.disasters.down <- rbind(balanced.disasters.down, slice_sample(filter(nat.disasters.mod3, nat.disasters.mod3$Type == t), n = 500, replace = TRUE))
  }
  else {
    balanced.disasters.down <- rbind(balanced.disasters.down, slice_sample(filter(nat.disasters.mod3, nat.disasters.mod3$Type == t), n = 500, replace = FALSE))
  }
}

train.ind.down <- sample(1:nrow(balanced.disasters.down), size = 0.7 * nrow(balanced.disasters.down), replace = FALSE)
train.down <- balanced.disasters.down[train.ind.down,]
test.down <- balanced.disasters.down[-train.ind.down,]

temp.co2.svm <- svm(as.factor(Type) ~ LandAverageTemperature + co2, data = train.down)
pred.svm <- predict(temp.co2.svm, test.down)
conf.svm <- confusionMatrix(as.factor(test.down$Type), pred.svm)
conf.svm
#print_output(conf.svm, cex = 0.4)